Take-home Exercise 3

Visual Analytics of Resale Prices of Singapore Public Housing Properties

Author

Michael Djohan

Published

February 5, 2023

Modified

February 7, 2023

1. Overview

This exercise aims to uncover the salient patterns of the resale prices of public housing property by residential towns and estates in Singapore using appropriate analytical visualisation techniques. The visualization is designed using ggplot2, its extensions, and tidyverse packages.

The original dataset was downloaded from Data.gov.sg titled Resale flat princes based on registration date from Jan-2017 onwards.

The file downloaded was resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv

The focus of the study is on 3-ROOM, 4-ROOM and 5-ROOM types for 2022 period.

2. Data Preparation

2.1 Install R packages and import dataset

The code chunk below uses pacman::p_load() to check if packages are installed. If they are, they will be launched into R. The packages installed are

  • plotly: Used for creating interactive web-based graphs.

  • ggstatsplot: Used for creating graphics with details from statistical tests.

  • knitr: Used for dynamic report generation

  • tidyverse: A collection of core packages designed for data science, used extensively for data preparation and wrangling.

    All packages can be found within CRAN.

Import data from csv using readr::read_csv() and store it in variable flatprice.

#Load packages
pacman::p_load(plotly, ggstatsplot, knitr, tidyverse)

#Import data
flatprice <- read_csv("data/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv", show_col_types = FALSE)

2.2 Data wrangling

Looking at the data below, we notice few problems

  • month is in <chr> format (“yyyy-mm”), which is not very useful for filtering for 2022 period

  • lease_commence_date is in <dbl> format. It needs to be converted to <int>

  • remaining_lease is in <chr> format. It needs to be reformatted to <dbl> in years

flatprice
# A tibble: 146,215 × 11
   month   town    flat_…¹ block stree…² store…³ floor…⁴ flat_…⁵ lease…⁶ remai…⁷
   <chr>   <chr>   <chr>   <chr> <chr>   <chr>     <dbl> <chr>     <dbl> <chr>  
 1 2017-01 ANG MO… 2 ROOM  406   ANG MO… 10 TO …      44 Improv…    1979 61 yea…
 2 2017-01 ANG MO… 3 ROOM  108   ANG MO… 01 TO …      67 New Ge…    1978 60 yea…
 3 2017-01 ANG MO… 3 ROOM  602   ANG MO… 01 TO …      67 New Ge…    1980 62 yea…
 4 2017-01 ANG MO… 3 ROOM  465   ANG MO… 04 TO …      68 New Ge…    1980 62 yea…
 5 2017-01 ANG MO… 3 ROOM  601   ANG MO… 01 TO …      67 New Ge…    1980 62 yea…
 6 2017-01 ANG MO… 3 ROOM  150   ANG MO… 01 TO …      68 New Ge…    1981 63 yea…
 7 2017-01 ANG MO… 3 ROOM  447   ANG MO… 04 TO …      68 New Ge…    1979 61 yea…
 8 2017-01 ANG MO… 3 ROOM  218   ANG MO… 04 TO …      67 New Ge…    1976 58 yea…
 9 2017-01 ANG MO… 3 ROOM  447   ANG MO… 04 TO …      68 New Ge…    1979 61 yea…
10 2017-01 ANG MO… 3 ROOM  571   ANG MO… 01 TO …      67 New Ge…    1979 61 yea…
# … with 146,205 more rows, 1 more variable: resale_price <dbl>, and
#   abbreviated variable names ¹​flat_type, ²​street_name, ³​storey_range,
#   ⁴​floor_area_sqm, ⁵​flat_model, ⁶​lease_commence_date, ⁷​remaining_lease

is.na() function is also used to confirm that there are no missing values in the flatprice dataset.

#Check for missing values
any(is.na(flatprice))
[1] FALSE

The code chunk below performs the required data wrangling to clean flatprice dataset and store it in new variable flatpriceclean.

  • Filter flat_type for 3 ROOM, 4 ROOM, and 5 ROOM as this is the scope of the study using dplyr::filter()
  • Convert the month variable to date using as.Date(). Store the year and month to the respective new variables year and month using dplyr:mutate(). They can then be converted to integer using as.integer(). Afterwards, we can filter the year variable to 2022 using dplyr::filter(), which is the scope of the study
  • Extract the year and month digits from remaining_lease variable using str_extract() function. Sum the year digit and (month digit/12) to obtain the years of remaining lease and convert it to <dbl> format using as.numeric(). They are then rounded to 1 decimal place using round(). Assign new variable called remaining_lease_years using dplyr::mutate()
  • Create new variable called resale_price_persqm to divide the resale_price with floor_area_sqm. This is performed to normalize the resale price to flat area. They are then rounded to 1 decimal place using round(). The new variable is assigned using dplyr::mutate
  • Convert lease_commence_date to <int> using as.integer()
#Data preparation
#store the new dataset in new variable flatpriceclean
flatpriceclean <- flatprice |> 
  
#filter for 3-ROOM, 4-ROOM, 5-ROOM
  filter(flat_type %in% c('3 ROOM','4 ROOM','5 ROOM')) |> 
  
#reformat month and split it to month and year. Use year to filter for 2022
  mutate(year = as.integer(format(as.Date(paste(month, "-01", sep="")), "%Y")),
         month = as.integer(format(as.Date(paste(month, "-01", sep="")), "%m")),
         .before = 1)|> 
  filter(year == 2022) |> 

#mutate remaining_lease to remaining_lease_years
  mutate(remaining_lease_years = round((as.numeric(str_extract(remaining_lease, "^[0-9]+")) + 
                           ifelse(is.na(as.numeric(str_extract(remaining_lease, " [0-9]+"))), 0, as.numeric(str_extract(remaining_lease, " [0-9]+")))/12), digits = 1),
         
#create new variable called resale_price_persqm
         resale_price_persqm = round(resale_price/floor_area_sqm, digits = 1),

#convert lease_commence_date to integer
         lease_commence_date = as.integer(lease_commence_date),
         .after = remaining_lease) 

The final dataset flatpriceclean is displayed below.

kable(head(flatpriceclean), "simple")
year month t own f lat_type b lock s treet_name s torey_range floor_area_sqm f lat_model lease_commence_date r emaining_lease remaining_lease_years resale_price_persqm resale_price
2022 1 ANG MO KIO 3 ROOM 320 ANG MO KIO AVE 1 07 TO 09 73 New Generation 1977 54 years 05 months 54.4 4904.1 358000
2022 1 ANG MO KIO 3 ROOM 225 ANG MO KIO AVE 1 07 TO 09 67 New Generation 1978 55 years 01 month 55.1 5298.5 355000
2022 1 ANG MO KIO 3 ROOM 331 ANG MO KIO AVE 1 07 TO 09 68 New Generation 1981 58 years 58.0 4970.6 338000
2022 1 ANG MO KIO 3 ROOM 534 ANG MO KIO AVE 10 07 TO 09 82 New Generation 1980 57 years 02 months 57.2 5122.0 420000
2022 1 ANG MO KIO 3 ROOM 578 ANG MO KIO AVE 10 04 TO 06 67 New Generation 1980 57 years 01 month 57.1 4895.5 328000
2022 1 ANG MO KIO 3 ROOM 452 ANG MO KIO AVE 10 01 TO 03 83 New Generation 1979 56 years 07 months 56.6 4337.3 360000

3. Visualisation

3.1 Exploratory Data Visualisation

The plots here are preliminary in nature but designed with interactivity to allow users to perform Exploratory Data Analysis (EDA) Visualisation to study the data.

3.1.1. Interactive scatterplot

Design Consideration

Scatterplots are generally used to discover relationship between two continuous variables. As such, the visualization below allows users to select the x-axis and y-axis of the continuous variables they wish to study. Considerations :

  • y-axis selection is resale_price and resale_price_persqm. This is aligned with the study purpose of discovering patterns of resale price. This allows users to also see the intent of normalising resale_price by floor_area_sqm

  • x-axis selection is other continuous variables, namely: remaining_lease_years, lease_commence_date, and floor_area_sqm

  • As the plots are expected to be very scattered, opacity is introduced with white border

  • Tooltip indicating variables of interest : remaining_lease_years, floor_area_sqm, and resale_price_persqm

Preparation of visualisation

plot_ly is used to prepare the interactive plot. Steps taken are

  • Initiating base scatterplot, indicated by type = 'scatter'. marker argument is used to introduce opacity and line (white plot border)

  • Tooltip customisation is created using hovertemplate argument

  • layout argument is used to add plot title, x-axis title, and y-axis title

  • To create the dropdown menu for parameters of x-axis and y-axis, the updatemenus argument is used to create respective buttons

Refer to code below for more details

Show the code
#Initiating the base plot
plot_ly(data = flatpriceclean,
        x = ~remaining_lease_years,
        y = ~resale_price_persqm,
        hovertemplate = ~paste("<br>Resale Price per sqm:", resale_price_persqm,
                               "<br>Floor Area (sqm):", floor_area_sqm,
                               "<br>Remaining Lease (Year):", remaining_lease_years),
        type = 'scatter',
        mode = 'markers',
        marker = list(opacity = 0.6,
                      sizemode = 'diameter',
                      line = list(width = 0.2, color = 'white'))) |> 

#Generating plot, x-axis, and y-axis title
  layout(title = "Interactive scatterplot of resale price vs other factors\nResale transactions, 2022",
         xaxis = list(title = "Remaining Lease (Year)"),
         yaxis = list(title = "Resale Price per sqm (SGD)"),
         
#creating dropwdown menus to allow selection of parameters on x-axis and y-axis 
         updatemenus = list(list(type = "dropdown",
                                 direction = "up",
                                 xref = "paper",
                                 yref = "paper",
                                 xanchor = "left",
                                 yanchor = "top",
                                 x = 1,
                                 y = 0,
                                 buttons = list(
                                   list(method = "update",
                                        args = list(list(x = list(flatpriceclean$remaining_lease_years)),
                                                    list(xaxis = list(title = "Remaining Lease (Year)"))),
                                        label = "Remaining Lease"),
                                   list(method = "update",
                                        args = list(list(x = list(flatpriceclean$lease_commence_date)),
                                                    list(xaxis = list(title = "Year of Lease Commenced"))),
                                        label = "Lease Commenced"),
                                   list(method = "update",
                                        args = list(list(x = list(flatpriceclean$floor_area_sqm)),
                                                    list(xaxis = list(title = "Floor Area (sqm)"))),
                                        label = "Floor Area")
                                   )
                                 ),
                            
                            list(type = "dropdown",
                                 xref = "paper",
                                 yref = "paper",
                                 xanchor = "left",
                                 yanchor = "top",
                                 x = 0.04,
                                 y = 0.95,
                                 buttons = list(
                                   list(method = "update",
                                        args = list(list(y = list(flatpriceclean$resale_price_persqm)),
                                                    list(yaxis = list(title = "Resale Price per sqm (SGD)"))),
                                        label = "Resale Price/Area"),
                                   list(method = "update",
                                        args = list(list(y = list(flatpriceclean$resale_price)),
                                                    list(yaxis = list(title = "Resale Price (SGD)"))),
                                        label = "Resale Price")
                                   )
                                 )
                            )
         )

Insights

  • The most obvious pattern is the direct positive correlation between resale price and floor area. Hence the decision to normalize the resale price is justified. From here onwards, we will focus more on resale price per sqm.

  • Plotting resale price per sqm vs remaining lease reveals somewhat positive exponential correlation. There seems to sharper increase in gradient if the remaining lease is beyond 80 years

  • As expected, the same pattern is revealed when resale price per sqm is plotted against the year of lease commenced. In fact the year of lease commenced and remaining lease are directly correlated to each other as shown in the plot below. Hence it makes sense to use one of them in future analysis. In this case, we will only use remaining lease as it is more intuitive.

Show the code
ggplot(data = flatpriceclean)+

  geom_point(aes(x = lease_commence_date,
                 y = remaining_lease_years)) +
  
  labs(
    x = "Year of Lease Commenced",
    y = "Remaining Lease\n(Year)") +

  theme(axis.title.y = element_text(angle = 0))

3.1.2. Interactive boxviolin plot

Design Consideration

Scatterplots are generally used to discover relationship between two continuous variables. As such, the visualization below allow

Show the code
plot_ly(data = flatpriceclean,
        x = ~flat_type,
        y = ~resale_price_persqm,
        type = "violin",
        alpha = 0.3,
        marker = list(opacity = 0.6),
        box = list(visible = T),
        meanline = list(visible = T)) |> 
  
  layout(title = "Distribution of resale price by selected factors, \nResale transactions, 2022",
         xaxis = list(title = ""),
         yaxis = list(title = "Resale Price per sqm (SGD)"),
         updatemenus = list(list(type = 'dropdown',
                                 xref = "paper",
                                 yref = "paper",
                                 xanchor = "left",
                                 x = 0.04, 
                                 y = 0.95,
                                 buttons = list(
                                   list(method = "update",
                                        args = list(list(x = list(flatpriceclean$flat_type)),
                                                    list(xaxis = list(categoryorder = "category ascending"))),
                                        label = "Flat Type"),
                                   list(method = "update",
                                        args = list(list(x = list(flatpriceclean$flat_model)),
                                                    list(xaxis = list(categoryorder = "mean ascending"))),
                                        label = "Flat Model"),
                                   list(method = "update",
                                        args = list(list(x = list(flatpriceclean$storey_range)),
                                                    list(xaxis = list(categoryorder = "category ascending"))),
                                        label = "Storey Height"),
                                   list(method = "update",
                                        args = list(list(x = list(flatpriceclean$town)),
                                                    list(xaxis = list(categoryorder = "mean ascending"))),
                                        label = "Town"),
                                   list(method = "update",
                                        args = list(list(x = list(flatpriceclean$month)),
                                                    list(xaxis = list(tickmode = "array")),
                                                    list(color = list(flatpriceclean$month))),
                                        label = "Transaction Month")
                              
                                   )
                                 )
                            )
         )

3.2 Confirmatory Data Analysis Visualization

The first plot is to investigate other factors that might impact the resale price.

Show the code
ggbetweenstats(
  data = flatpriceclean,
  x = flat_type, 
  y = resale_price_persqm, 
  xlab = "Types of Flat (Rooms)",
  ylab = "Resale Price per sqm (SGD)",
  palette = "Paired",
  title = "One-way ANOVA analysis reveals at least one significant difference in 2022 resale price across different flat types",
  type = "np", 
  pairwise.comparisons = TRUE,
  pairwise.display = "ns", 
  mean.ci = TRUE, 
  p.adjust.method = "fdr", 
  messages = FALSE 
  ) 

The second plot is to investigate other factors that might impact the resale price.

Show the code
ggbetweenstats(
  data = flatpriceclean |> 
    mutate(storey_range = ifelse(storey_range %in% c("40 TO 42", "43 TO 45", "46 TO 48", "49 TO 51"), "40+", storey_range)), 
  x = storey_range, 
  y = resale_price_persqm, 
  xlab = "Storey Height",
  ylab = "Resale Price per sqm (SGD)",
  palette = "Paired",
  title = "One-way ANOVA analysis reveals at least one significant difference in 2022 resale price across different storeys",
  type = "np", 
  pairwise.comparisons = TRUE,
  pairwise.display = "ns", 
  mean.ci = TRUE, 
  p.adjust.method = "fdr", 
  messages = FALSE 
  ) 

Thirdly, check the flat_model variables. Filtering for number of observations >= 50

Show the code
flatpriceclean$flat_model <- fct_reorder(flatpriceclean$flat_model, flatpriceclean$resale_price_persqm)

ggbetweenstats(
  data = flatpriceclean |> 
    group_by(flat_model) |> 
    filter(n() >= 50),
  x = flat_model, 
  y = resale_price_persqm, 
  xlab = "Flat Model",
  ylab = "Resale Price per sqm (SGD)",
  palette = "Paired",
  title = "One-way ANOVA analysis reveals at least one significant difference in 2022 resale price across different models",
  type = "np", 
  pairwise.comparisons = TRUE,
  pairwise.display = "ns",  
  mean.ci = TRUE, 
  p.adjust.method = "fdr", 
  messages = FALSE 
  ) 

The town variables are skipped as there are too many variables -> to be considered in the final visualization

Lastly, check the transaction month variables

Show the code
ggbetweenstats(
  data = flatpriceclean,
  x = month, 
  y = resale_price_persqm, 
  xlab = "Month of Transaction",
  ylab = "Resale Price per sqm (SGD)",
  palette = "Paired", 
  title = "One-way ANOVA analysis reveals at least one significant difference in 2022 resale price across different \ntransaction months",
  type = "np", 
  pairwise.comparisons = TRUE,
  pairwise.display = "ns",   
  mean.ci = TRUE, 
  p.adjust.method = "fdr", 
  messages = FALSE 
  ) 

3.3 Visualization of Resale Price by Township

Show the code
town_list <- list()
for (i in 1:length(unique(flatpriceclean$town))) { 
  town_list[[i]] <- list(method = "restyle",
                         args = list("transforms[0].value",
                                     unique(flatpriceclean$town)[i]),
                         label = unique(flatpriceclean$town)[i])
  }

annot <- list(list(text = "Select Towns:",
                   x = 1.41,
                   y = 0.78,
                   xref = 'paper',
                   yref = 'paper',
                   showarrow = FALSE))
Show the code
flatpriceorder <- flatpriceclean[order(flatpriceclean$flat_type), ]

plot_ly(data = flatpriceclean,
        x = ~remaining_lease_years,
        y = ~resale_price_persqm,
        hovertemplate = ~paste("<br>Resale Price per sqm:", resale_price_persqm,
                               "<br>Floor Area (sqm):", floor_area_sqm,
                               "<br>Remaining Lease (Year):", remaining_lease_years,
                               "<br>Town:", town),
        type = 'scatter',
        mode = 'markers',
        size = ~floor_area_sqm,
        sizes = c(5, 15),
        color = ~factor(flat_type),
        marker = list(opacity = 0.6,
                      sizemode = 'diameter',
                      line = list(width = 0.2, color = '#FFFFFF')),
        transforms = list(list(type = 'filter',
                               target = ~flatpriceorder$town,
                               operation = '=',
                               value = unique(flatpriceorder$town)[1])
                          )
        ) |> 
  
  layout(title = "Resale Price per flat area increases with remaining lease \nResale transactions by towns, 2022",
         xaxis = list(title = "Remaining Lease (Year)",
                      range = c(40, 100)),
         yaxis = list(title = "Resale Price per sqm (SGD)",
                      range = c(3000, 16000)),
         updatemenus = list(list(type = 'dropdown',
                                 xref = "paper",
                                 yref = "paper",
                                 x = 1.4, y = 0.7,
                                 buttons = town_list)
                            ),
         
         annotations = annot
         )